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Abstract 

Complex event retrieval is a challenging research prob¬ 
lem, especially when no training videos are available. An 
alternative to collecting training videos is to train a large 
semantic concept bank a priori. Given a text description of 
an event, event retrieval is performed by selecting concepts 
linguistically related to the event description and fusing the 
concept responses on unseen videos. However, defining an 
exhaustive concept lexicon and pre-training it requires vast 
computational resources. Therefore, recent approaches au¬ 
tomate concept discovery and training by leveraging large 
amounts of weakly annotated web data. Compact visually 
salient concepts are automatically obtained by the use of 
concept pairs or, more generally, n-grams. However, not 
all visually salient n-grams are necessarily useful for an 
event query-some combinations of concepts may be visu¬ 
ally compact but irrelevant-and this drastically affects per¬ 
formance. We propose an event retrieval algorithm that 
constructs pairs of automatically discovered concepts and 
then prunes those concepts that are unlikely to be help¬ 
ful for retrieval. Pruning depends both on the query and 
on the specific video instance being evaluated. Our ap¬ 
proach also addresses calibration and domain adaptation 
issues that arise when applying concept detectors to unseen 
videos. We demonstrate large improvements over other vi¬ 
sion based systems on the TRECVID MED 13 dataset. 


1. Introduction 

Complex event retrieval from databases of videos is dif¬ 
ficult because in addition to the challenges in modeling the 
appearance of static visual concepts-e.g., objects, scenes- 
modeling events also involves modeling temporal varia¬ 
tions. In addition to the challenges of representing motion 
features and time, one particularly pernicious challenge is 
that the number of potential events is much greater than 
the number of static visual concepts, amplifying the well- 

*The first two authors contributed equally to this paper. 


known long-tail problem associated with object categories. 
Identifying and collecting training data for a comprehensive 
set of objects is difficult. For complex events, however, the 
task of even enumerating a comprehensive set of events is 
daunting, and collecting curated training video datasets for 
them is entirely impractical. 

Consequently, a recent trend in the event retrieval com¬ 
munity is to define a set of simpler visual concepts that are 
practical to model and then combine these concepts to de¬ 
fine and detect complex events. This is often done when no 
examples of the complex event of interest are available for 
training. In this setting, training data is still required, but 
only for the more limited and simpler concepts. For exam¬ 
ple, BED discover and model concepts based on single 
words or short phrases, taking into account how visual the 
concept is. Others model pairs of words or n-grams in or¬ 
der to disambiguate between the multiple visual meanings 
of a single word ID and take advantage of co-occurrences 
present in the visual world 1^ . An important aspect of re¬ 
cent work (mia is that concept discovery and training set 
annotation is performed automatically using weakly anno¬ 
tated web data. Event retrieval is performed by selecting 
concepts linguistically related to the event description and 
computing an average of the concept responses as a measure 
for event detection. 

Based on recent advances, we describe a system that 
ranks videos based on their similarity to a textual descrip¬ 
tion of a complex event, using only web resources and with¬ 
out additional human supervision. In our approach, the tex¬ 
tual description is represented by and detected through a 
set of concepts. Our approach builds on 0 for discover¬ 
ing concepts given a textual description of a complex event, 
and 191 for automatically replacing the initial concepts with 
concept pairs that are visually salient and capture specific 
visual meanings. 

However, we observe that many visually salient concepts 
generated from an event description are not useful for de¬ 
tecting the event. In fact, we find that removing certain con¬ 
cepts is a key step that significantly improves event retrieval 
performance. Some concepts should be removed at training 
time because they model visually salient concepts that are 
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Figure 1. Framework overview. An initial set of concepts is discovered from the web and transformed to concept pairs using an action 
centric part of speech (grammar) model. These concept pairs are used as Google Image search text queries, and detectors are trained on the 
search results. Based on the detector scores on the test videos, co-occurrence based pruning removes concepts that are likely to be outliers. 
Detectors are calibrated using a rank based re-scoring method. An instance level pruning method determines how many concepts are likely 
to be observed in a video and discards the lowest scoring concepts. The scores of remaining concepts are fused to score each video. Motion 
features of the top ranked videos are used to train a SVM and update the video list. Finally, the initial detectors are re-trained using the 
top ranked videos of this video list, and the process of co-occurrence based pruning, instance level pruning and rank based calibration is 
repeated to re-score the videos. 


not likely to be meaningful based on linguistic considera¬ 
tions. Others should be removed if an analysis of video co¬ 
occurrences and activation patterns indicates that a concept 
is likely to be irrelevant or not among the subset of concepts 
that occur in a video instance. These problems are further 
confounded by the fact that concept detectors are initially 
trained on weakly supervised web image^ so there is a do¬ 
main shift to video, and detector responses are not properly 
calibrated. 

Our contribution is a fully automatic algorithm that dis¬ 
covers concepts that are not only visually salient, but are 
also likely to predict complex events by exploiting co¬ 
occurrence statistics and activation patterns of concepts. We 
address domain adaptation and calibration issues in addition 
to modelling the temporal properties. Evaluations are con¬ 
ducted using the TRECVID EKO dataset, where our system 
outperforms state-of-the-art methods based on visual infor¬ 
mation. 


^ We prefer to use web images for concept training because a web search 
is a weak form of supervision which provides no spatial or temporal local¬ 
ization. This means that if we search for video examples of a concept, we 
do not know how many and which frames contain the concept (a temporal 
localization issue), while an image result is much more likely to contain 
the concept of interest (the spatial localization still remains). 


2 . Related Work 

Large scale video retrieval commonly employs a 
concept-based video representation (CBRE) 
especially when only few or no training examples of the 
events are available. In this setting, complex events are 
represented in terms of a large set of concepts that are ei¬ 
ther event-driven (generated once the event description is 
known) ( 5 ] [TSl 1211 or pre-defined U\ 0. A test query 
description is mapped to a set of concepts whose detectors 
are then applied to videos to perform retrieval. However, 
methods based on pre-defined concepts need to train an ex¬ 
haustive set of concept detectors a priori or the semantic 
gap between the query description and the concept database 
might be too large. This is computationally expensive and 
currently infeasible for real-world video retrieval systems. 
Instead, in this paper, given the textual description of the 
event to be retrieved, our approach leverages web image 
data to discover event-driven concepts and train detectors 
that are relevant to this specific event. 

Recently, web (Internet) data has been widely used for 
knowledge discovery (TT] |2l [51 |29l [Tol [141. Chen et al. 
fi use web data to weakly label images, learn and ex¬ 
ploit common sense relationships. Berg et al. O automati¬ 
cally discover attributes from unlabeled Internet images and 


































their associated textual descriptions. Duan et al. ifTTIl de¬ 
scribe a system that uses a large amount of weakly labeled 
web videos for visual event recognition by measuring the 
distance between two videos and a new transfer learning 
method. Habibian et al. o obtain textual descriptions of 
videos from the web and learn a multimedia embedding for 
few-example event recognition. For concept training, given 
a list of concepts, each corresponding to a word or short 
phrase, web search is commonly used to construct weakly 
annotated training sets (H |29j [9l . We use the concept name 
as a query to a search engine, and train the concept detector 
based on the returned images. 

Moreover, retrieval performance depends on high qual¬ 
ity concept detectors. While the performance of a concept 
detector can be estimated (e.g., by cross-validation 0) , am¬ 
biguity remains in associating linguistic concepts to visual 
concepts. For example, groom in grooming an animal and 
groom in wedding ceremony are totally different, and while 
two separate detectors might be capable of modeling both 
types of groom separately, a single groom detector would 
likely perform poorly. Similarly, tire images from the web 
are different from frames containing tires in a video about 
changing a vehicle tire, since there are often people and 
cars in these frames. To solve this problem, EllISJ use an 
n-gram model to differentiate between multiple senses of a 
word. Habibian et al. ca instead leverage logical relation¬ 
ships (e.g., “OR”, “AND”, “XOR”) between two concepts. 
Mensink et al. 1^ exploit label co-occurrence statistics 
to address zero-shot image classification. However, it is not 
sufficient to discover visually distinctive concepts, since not 
all concepts are equally informative for modeling events. 
We present a pruning process to discover visually distinc¬ 
tive and useful concepts by a pruning process. 

Recent work has also explored multiple modalities- 
e.g., automatic speech recognition (ASR), optical charac¬ 
ter recognition (OCR), audio, and vision-for event detec¬ 
tion (161 El ESI to achieve better performance over vision 
alone. Jiang et al. Cll propose MultiModel Pseudo Rele¬ 
vance Feedback (MMPRF), which selects several feedback 
videos for each modality to train a joint model. Applied to 
test videos, the model yields a new ranked video list that 
is used as feedback to retrain the model. Wu et al. 1291 
represent a video by using a large concept bank, speech in¬ 
formation, and video text. These features are projected to 
a high-dimensional concept space, where event/video sim¬ 
ilarity scores are computed to rank videos. While multi¬ 
modal techniques achieve good performance, their visual 
components alone significantly under-perform the system 
as a whole. 

All these methods suffer from calibration and domain 
adaptation issues, since CBRE methods fuse multiple con¬ 
cept detector responses and are usually trained and tested 
on different domains. To deal with calibration issues, most 


related work uses SVMs with probabilistic outputs (201 . 
However, the domain shift between web training data and 
test videos is usually not addressed by calibration alone. To 
reduce this effect, some ranking-based re-scoring schemes 
{TE[ [TtI replace raw detector confidences with the confi¬ 
dence rank in a list of videos. To further adapt to new do¬ 
mains (e.g., from images to videos), easy samples have been 
used to update detector models GzlIISl. Similar to these ap¬ 
proaches, we use a rank based re-scoring scheme to address 
calibration issues and update models using the most confi¬ 
dent detections to adapt to new domains. 

3. Overview 

The framework of our algorithm is shown in Fig. 
Given an event defined as a text query, our algorithm re¬ 
trieves and ranks videos by relevance. The algorithm first 
constructs a bank of concepts by the approach of iia and 
transforms it into concept pairs. These concept pairs are 
then pruned by a part of speech model. Each remaining con¬ 
cept pair is used as a text query in a search engine (Google 
Images), and the returned images are used to train detectors, 
which are then applied to the test videos. Based on detector 
responses on test videos, co-occurrence based pruning re¬ 
moves concept pairs that are likely to be outliers. Detectors 
are then calibrated using a rank based re-scoring method. 
An instance level pruning method determines how many 
concept pairs should be observed in a video from the class, 
discarding the lowest scoring concepts. The scores of the 
remaining concept pairs are fused to rank the videos. Mo¬ 
tion features of the top ranked videos are then used to train 
a SVM and re-rank the video list. Finally, the top ranked 
videos are used to re-train the concept detectors, and we 
use these detectors to re-score the videos. 

The following sections describe each part of our ap¬ 
proach in detail. 

4. Concept Discovery 

The concept discovery method of 0 exploits weakly 
tagged web images and yields an initial list of concepts for 
an event. Most of these visual concepts correspond to single 
words, so they may suffer from ambiguity between linguis¬ 
tic and visual concepts. Consequently, we follow 0 by 
using n-grams to model specific visual meanings of linguis¬ 
tic concepts and 1^ by using co-occurrences. From the top 
P concepts in the list provided by 0, we combine single¬ 
word concepts into pairs and retain the phrase concepts to 
form a new set of concepts. The resulting concepts reduce 
visual ambiguity and are more informative. We refer to the 
concepts trained on pairs of words as pair-concepts. 

Fig. shows the frames ranked highest by the proposed 
pair-concept detectors, the original concept detectors for 
single words, and the sum of two independently trained con- 
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Figure 2. Top five ranked videos by different concept detectors trained using web images for three events: (a) attempting a bike trick, (b) 
changing a vehicle tire, (c) getting a vehicle unstuck. The first and second rows show the results of running unary concepts on test videos. 
The third row combines two unary concept detectors by adding their scores. The fourth row shows the results of our proposed pair-concept 
detectors. Pair-concepts are more effective at discovering frames that are more semantically relevant to the event. 


cept detectors on the words constituting the pair-concept. 
Pair-concept detectors are more relevant to the event than 
the unary detectors or the sum of two detectors. For exam¬ 
ple, in Fig. the event query is attempting a bike trick, and 
two related concepts are jump and bicycle. The jump de¬ 
tector can only detect a few instances of jumping, none of 
which are typical of a bike trick. The bicycle detector suc¬ 
cessfully detects bicycles, but most detections are of people 
riding bicycles instead of performing a bike trick. If the 
two detectors are combined by adding their scores, some 
frames with bikes and jump actions are obtained, but they 
are still not relevant to hike trick. However, the jump bi¬ 
cycle detections are much more relevant to attempting hike 
rnc^-people riding a bicycle are jumping off the ground. 

Concepts which do not result in good visual models 
(e.g., cute water, dancing blood) can be identified Eia. 
But, even when concepts lead to good visual models, they 
might still not be informative (e.g., car truck, and puppy 
dog). Moreover, even if concepts are visual and informa¬ 
tive, videos do not always exhibit all concepts related to an 
event, so expecting all concepts to be observed will reduce 
retrieval precision. For these reasons, it is not only nec¬ 
essary to select concepts that can be modeled visually, but 
also to identify subsets of them that are useful to the event 
retrieval task. We propose three concept pruning schemes to 
remove bad concepts: pruning based on grammatical parts 
of speech, pruning based on co-occurrence on test videos, 
and instance level pruning. The first two schemes remove 
concepts that are unlikely to be informative, while the last 
identifies a subset of relevant concepts for each video in¬ 
stance. 

4.1. Part of speech based pruning 

Action centric concepts are effective for video recogni¬ 
tion, as shown in EEa. Based on this, we require that 
a pair-concept contain one of three types of action centric 
words: 1) Nouns that are events, e.g., party, parade; 2) 
Nouns that are actions, like celebration, trick; 3) Verbs, e.g., 
dancing, cooking, running. Word types are determined by 
their lexical information and frequency counts provided by 
WordNet 1^ . Then, action centric concepts are paired with 
other concepts that are not action centric to yield the final 


set of pair-concepts. 

Table shows the pair-concepts discovered for an event. 
Qualitatively, these concepts are more semantically relevant 
to events than the single word concepts from An im¬ 
provement would be to learn the types of pair-concepts that 
lead to good event models, based on their parts of speech. 
However, as our qualitative and quantitative results show, 
the proposed action-centric pruning rule leads to significant 
improvements over using all pairs, so we leave data-driven 
learning for future work. 

Pair-concept detectors are trained automatically using 
web images. For each concept, 200 images are chosen as 
positive examples, downloaded by using the concept as the 
textual query for image search on Google Images. Then, 
500 negative examples are randomly chosen from the im¬ 
ages of other concepts from all events. Based on the deep 
features im of these examples, the detectors are trained us¬ 
ing a RBF kernel SVM using LibSVM (H with the default 
parameters. 

4.2. Co-occurrence based pruning 

Not all action-centric pair-concepts will be useful, for a 
number of reasons. First, the process of generating unary- 
concepts from an event description is uncertain O, and 
might generate irrelevant ones. Second, even if both unary 
concepts are relevant individually, they may lead to non¬ 
sensical pairs. And finally, even if both unary concepts are 
relevant, web search sometimes returns irrelevant images 
which can pollute the training of concept detectors. 

To reduce the influence of visually unrelated and noisy 
concepts, we search for co-occurrences between detector 
responses and keep only pair-concepts whose detector out¬ 
puts co-occur with other pair-concepts at the video level. 
The intuition is that co-occurrences between good concepts 
will be more frequent than coincidental co-occurrences be¬ 
tween bad concepts. One reason for this is that if two pair- 
concepts are both relevant to the same complex event, they 
are more likely to fire in a video of that event. Another rea¬ 
son is that detectors are formed from pairs of concepts, so 
many pair-concepts will share a unary concept and so are 
likely to be semantically similar to some extent. For ex¬ 
ample cleaning kitchen and washing kitchen share kitchen. 
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Figure 3. Example of co-occurrence based concept pruning. The five rows correspond to the top 15 videos retrieved by five concept 
detectors {stuck car, stuck tire, stuck truck, stuck winter, stuck night) for detecting the event getting a vehicle unstuck. Frames from the 
same videos are marked with bounding boxes of the same color, and repeating colors across concept detectors denote co-occurrences. For 
example, the yellow border in rows corresponding to stuck car, stuck tire, and stuck truck signifies that all three concept detectors co-occur 
in the video represented by the yellow color. The solid boxes denote the positive videos, while the dashed ones are negatives. Stuck night 
and stuck winter do not co-occur often with other concepts, so they are discarded. Note that the negatives are all marked with red cross in 
the upper-right corner. 


In other cases, pair-concepts may share visual properties as 
they are derived for a specific event, for example stuck car 
and stuck tire can co-occur because a tire can be detected 
along with a car in a frame or in a video. 

Let V = {Vi, V 2 , •••, Vat} denote the videos in the test 
dataset, where Vi contains Ni frames 
(sampled from the video for computational reasons). Given 
K concept detectors {Di, D 2 ,Dk} trained on web- 
images, each concept detector is applied to V. For each 
video Vi, 3.n Ni X K response matrix Si is obtained. Each 
element in Si is the confidence of a concept for a frame. 
After employing a hierarchical temporal pooling strategy 
(described in sectionon the response matrix, we obtain a 
confidence score Sik for the detector applied to video Vi. 
Then for each concept detector Dk, we rank the N videos 
in the test set based on the score Sik. Let denote the 
top M ranked videos in the ranking list.We construct a co¬ 
occurrence matrix C as follows: 

^ l<i,j <K,i^j 

\o i=j, 

where \Li fl Lj\ is the number of videos common to Li 
and Lj. A concept detector Di is said to co-occur with an¬ 
other detector Dj if Cij > t, where t is between 0 and 1. 
A concept is discarded if it does not co-occur with c other 
concepts. 

An example is shown in Fig. Here, the top 15 ranked 
videos retrieved by five different concept detectors for the 
event getting a vehicle unstuck are shown. The stuck win¬ 
ter detector co-occurs with other detectors in only one of 
the top 15 videos, the stuck night detector does not co¬ 
occur with any other detector, so these two detectors are 
discarded. Also, fewer positive examples of the complex 
event are retrieved by the two discarded detectors than the 
other three, suggesting that the co-occurrence based prun¬ 
ing strategy is effective in removing concepts which are 


outliers. After pruning some concepts using co-occurrence 
statistics, we fuse the scores of good concepts by taking the 
mean score of these concepts and rank the videos in the test 
set using this score. 

4.3. Instance Level Pruning 

Although many concepts may be relevant to an event, it 
is not likely that all concepts will occur in a single video. 
This is because not all complex event instances exhibit all 
related concepts, and not all concept instances are detected 
even if they are present (due to computer vision errors). 
Therefore, computing the mean score of all concept detec¬ 
tors for ranking is not a good solution. So, we need to pre¬ 
dict an event when only a subset of these concepts is ob¬ 
served. However, the subset is video instance specific and 
knowing all possible subsets a priori is not feasible with 
no training samples. Even though these subsets cannot be 
determined, we can estimate the average cardinality of the 
set based on the detector responses observed for the top M 
ranked videos after computing the mean score of detectors. 
For each event, the number of relevant concepts is estimated 
as: 


Nr = K — min{\ 
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where !(•) is the indicator function-it will be 1 if the con¬ 
fidence score of concept k present in video Vi is less than a 
detection threshold T (i.e., detector Dk does not detect the 
concept k in the video Vi) and 0 otherwise. [•] is the ceil¬ 
ing function, and A is a regularizer to control the maximum 
number of concepts to be pruned for an event. This equa¬ 
tion computes the average number of detected concepts in 
the top ranked videos. When combining the concept scores, 
we keep only the top Nr responses and discard the rest. 

































Table 1. Concepts discovered after different pruning strategies 


Event 

Discovered Concepts 

Working on a 
metal crafts 
project 

Initial Concepts 

art, bridge, iron, metal, new york, new york city, united state, work, worker 

After part of speech 
based pruning 

iron art, iron bridge, iron craft, metal art, metal bridge, metal craft, new york, new 
york city, united state, work iron, work metal, work worker, worker art, worker 
bridge, worker craft 

After co-occurrence 
based pruning 

iron art, iron craft, metal art, metal bridge, metal craft, work iron, work metal 

Dog show 

Initial Concepts 

animal, breed, car, cat, dog, dog show, flower, pet, puppy, show 

After part of speech 
based pruning 

animal pet, breed animal, breed car, breed cat, breed dog, breed flower, breed puppy, 
car pet, cat pet, dog pet, dog show, flower pet, puppy pet, show animal, show car, 
show cat, show dog, show flower, show puppy 

After co-occurrence 
based pruning 

animal pet, breed animal, breed car, breed cat, breed dog, breed puppy, cat pet, dog 
pet, dog show, puppy pet, show cat, show dog, show puppy 

Parade 

Initial Concepts 

city, gay, gay pride, gay pride parade, new york, new york city, nyc event, parade, 
people, pride 

After part of speech 
based pruning 

city parade, gay city, gay people, gay pride, gay pride parade, new york, new york 
city, nyc event, people parade, pride parade 

After co-occurrence 
based pruning 

city parade, gay pride, gay pride parade, people parade, pride parade 


5. Hierarchical Temporal Pooling 

Our concept detectors are frame-based, so we need a 
strategy to model the temporal properties of videos. A com¬ 
mon strategy is to treat the video as a bag, pooling all re¬ 
sponses by the average or max operator. However max¬ 
pooling tends to amplify false positives; on the other ex¬ 
treme, average pooling would be robust against spurious 
detections, but expecting a concept to be detected in many 
frames of a video is not realistic and would lead to false neg¬ 
atives. As a compromise, we propose hierarchical temporal 
pooling, where we perform max pooling within sub-clips 
and average over sub-clips over a range of scales. Note that 
the top level of this hierarchy corresponds to max pooling, 
the bottom level corresponds to average pooling, and the 
remaining levels correspond to something in-between. The 
score for a concept k in video Vi is computed as follows. 


^ik — 


I n 


EE 


TTlnj 

n 


(3) 


where, I is the maximum number of parts into which a video 
is partitioned (a scale at which the video is analyzed), rrinj 
is the max pooling score of the detector in part j of the 
video partitioned into n equal parts. Temporal pooling has 
been widely used in action recognition mi for representing 
space-time features. In contrast, we perform temporal pool¬ 
ing over SVM scores, instead of pooling low level features. 


6. Domain Adaptation 

Score Calibration. The detectors are trained on web- 
images, so their scores are not reliable because of the do¬ 
main shift between the web and video domains. In addition. 


each detector may have different response characteristics on 
videos, e.g., one detector is generic and has a high response 
for many videos, while another detector is specific and has 
a high response only for a few videos. Thus we calibrate 
their responses before fusion as follows : 


^ik — 


1 + 


(4) 


where, s' is the calibrated score, is the rank of video Vi 
when generating the rank list only using concept detector 
Dk, and u controls the decay factor in the exponential. This 
re-scoring function not only calibrates raw detector scores, 
but it also gives much higher score to highly ranked samples 
while ignoring the lower ranked ones, which is appropriate 
for retrieval. 

Detector Retraining. Based on the domain adaptation 
approach of ITTI . we use pseudo-relevance from top-ranked 
videos to improve performance. Since web-detectors only 
capture static scene/object cues in a video, it is beneficial to 
extract Fisher Vectors (FV) on Improved Dense Trajectory 
(IDT) features 1^ to capture the motion cues. Based on the 
rank list obtained from concept detectors, we train a linear 
SVM using LIBLINEAR (13 on the top ranked videos us¬ 
ing the extracted Fisher Vectors. The lowest ranked videos 
are used as negative samples. These detectors are applied 
again on the test videos. Finally, we use late fusion to com¬ 
bine the detection scores obtained using motion features 
with web-detectors. 

We further adapt the concept detectors to the video do¬ 
main by retraining them on frames from top-ranked videos. 
For each detector, we obtain frames with the highest re¬ 
sponse in the top ranked videos (after fusion with motion 
features) to train a concept detector (with the constraint that 
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Figure 4. Average Precision (AP) scores of initial concepts, all pair-concepts, the concepts after part of speech based and co-occurrence 
pruning are shown for the event ’’Getting a vehicle unstuck”. AP after combining the concepts is also reported. Note that part of speech 
pruning helps in removing many pair-concepts with low AP. Moreover, co-occurrence based pruning removes the two lowest performing 
pair-concepts and improves the AP after part of speech pruning significantly. 


similar frames should not be selected twice to encourage di¬ 
versity). We then repeat the process of co-occurrence based 
pruning, instance level pruning and rank based calibration 
to fuse the scores for the new concept detectors. Finally, 
the video scores are updated by summing the fused scores 
(original concept detectors + IDT) and the scores of adapted 
concept detectors. 

7. Experiments and Results 

We perform experiments on the challenging TRECVID 
Multimedia Event Detection (MED) 2013 dataset. We first 
verify the effectiveness of each component of our approach, 
and then show the improvement on the EKO dataset by com¬ 
paring with state-of-the-art methods. 

7.1. Dataset and Implementation Details 

The TRECVID MED 2013 EKO dataset consists of un¬ 
constrained Internet videos collected by the Linguistic Data 
Consortium from various Internet video hosting sites. Each 
video contains only one complex event or content not re¬ 
lated to any event. There are 20 complex events in total in 
this dataset, with ids 6-15 and 21-30. These event videos 
together with background videos (around 23,000 videos), 
form a test set of 24,957 videos. In the EKO setting, no 
ground-truth positives training videos are available. We ap¬ 
ply our algorithm on the test videos, and mAP score is cal¬ 
culated based on the video ranking. 

Eor each event in EKO dataset, we choose the top 10 con¬ 
cepts (i.e., P = 10) in the list provided by O and transform 
them into pair-concepts. The web image data on which con¬ 
cept detectors are trained is obtained by image search on 
Images using each pair-concept as a query. The Type op¬ 
tion is set to Photo to filter out irrelevant cartoon images. 
We downloaded around 200 images for each concept pair 
query. We sample each video every two seconds to obtain 
a set of frames. Then, we use Caffe ca to extract deep 


features on all the frames and web images, by using the 
model pre-trained on ImageNet. We used the fc7 layer after 
dropout, which generates a 4,096 dimensional feature for 
each video frame or image. The hyper-parameters to deter¬ 
mine if a concept co-occurs with another t, length of the in¬ 
tersection list M, regularization constant A, detector thresh¬ 
old Nr are selected based on leave one-event out cross vali¬ 
dation, since they should be validated with event categories 
different from the one being retrieved. We found hyper¬ 
parameters to be robust after doing sensitivity analysis. The 
number of levels I in hierarchical temporal pooling was set 
to 5. The Eisher Vectors of the top 50 ranked videos and the 
bottom 5,000 ranked videos are used to train a linear SVM. 


Table 2. Comparative results 


Pre-Defined 

Method 

mAP 

No 

Concept Discovery ll5l 

2.3% 

Yes 

SIN/DCNN 03 

2.5% 

Yes 

CD+WSC 1291 

6.12% 

Yes 

Composite Concept lT3l 

6.39% 


Initial concepts 

4.91% 


All Pair-concepts 

7.54% 

No 

+Part of speech pruning 

8.61% 


+Cooc & inst pruning 

10.85% 


+Adaptation 

11.81% 


7.2. Evaluation on MED 13 EKO 

Table shows the initial list of concepts, the concepts 
that remain after part of speech based pruning, and the 
concepts that remain after co-occurrence based pruning for 
three different events. Although the initial concepts are re¬ 
lated to the event, web queries corresponding to them would 
provide very generic search results. Since we have 10 unary 
concepts per event, there are 45 unique pair-concept detec¬ 
tors for each event. Approximately 10-20 pair-concepts re¬ 
main after part of speech based pruning. This helps to re- 
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Figure 5. Mean average precision (mAP) scores for the events on the MED 13 EKO dataset. By pruning concepts that are not useful for 
retrieving the complex event of interest, our approach progressively improves the utility of the remaining. 


duce the computational burden significantly and also prunes 
away noisy pairs. Finally, co-occurrence based pruning dis¬ 
cards additional outliers in the remaining pair-concepts. 

Table [2] shows the results of our method on the 
TRECVID EKO dataset. We observe significant perfor¬ 
mance gains (5.4% - 11.81% vs 6.39%) over other vision 
based methods which do not use any training samples. Our 
performance is almost 2-5 times their mAP. Note that the 
methods based on pre-defined concepts must bridge the se¬ 
mantic gap between the query specification and the pre¬ 
defined concept set. On the other hand, we leverage the 
web to discover concepts. Our approach follows the same 
protocol as which performs the same task. Using the 
same initial concepts as Q, our method obtains 5 times 
the mAP as that of a. Fig. 0 shows the effect of each 
stage in the pipeline. Replacing the initial set of con¬ 
cepts by action based pair-concepts provides the maximum 
gain in performance of ^3.7% (4.9% to 8.61%). Next, 
co-occurrence based pruning improves the mAP by 1.8% 
(8.61% to 10.4%). Calibration of detectors and instance 
level pruning improves the mAP score to 10.85%. Einally, 
adapting each detector on the test dataset and using motion 
information allows us reach a mAP of 11.81%. The perfor¬ 
mance is low for events 21 to 30 because there are only ^25 
videos for these events while events 6-15 have around 150 
videos each in the test set. 

To illustrate that the proposed pruning methods remove 
concepts with low AP, in Pig we plot AP scores of ini¬ 
tial unary concepts, all pair-concepts, part of speech based 
concepts and the concepts after co-occurrence based prun¬ 
ing. Note that almost 50% of pair-concepts had an average 
precision below 10% before pruning. After part of speech 
and co-occurrence based pruning, our approach is able to 
remove all these low scoring concepts in this example. 

We would note that Hierarchical Temporal Pooling pro¬ 
vides significant improvement in performance for this task. 
In Tablej^ we show mAP scores for different pooling meth¬ 


ods for initial, pair-concepts and after concept pruning (be¬ 
fore Detector Retraining). It is clear that Hierarchical Tem¬ 
poral Pooling improves performance in all three cases. We 
also observe that concepts after pruning have best perfor¬ 
mance across all pooling methods . 


Table 3. Pooling Results 



Initial 

All Pairs 

After Pruning 

Avg. Pooling 

2.84% 

4.54% 

5.94% 

Max. Pooling 

4.45% 

6.87% 

9.01% 

Hierarchical 

4.91% 

7.54% 

10.85% 


8. Conclusion 

We demonstrated that carefully pruning concepts can 
significantly improve performance for event retrieval when 
no training instances of an event are available, because even 
if concepts are visually salient, they may not be relevant to a 
specific event or video. Our approach does not require man¬ 
ual annotation, as it obtains weakly annotated data through 
web search, and is able to automatically calibrate and adapt 
trained concepts to new domains. 
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